Open Street Map

Originally written by Bon Woo Koo & Subhro Guhathakurta; modified by Uijeong Hwang

2024-10-16

What is Open Street Map

Open Street Map (OSM) is a “collaborative project to create a free editable geographic database of the world” (Wikipedia). OSM is free and has been adapted widely in academia as a reliable source of data. OSM provides various types of data, ranging from raster map tiles to geographic features in vector forms. We will focus on the vector data.

Overview

This document will walk to you through:

  • How to download OSM vector data.
  • How to convert the cleaned geometry vectors into a network.
  • How to calculate centrality measures.
  • How to do route finding (shortest path).

Downloading the data

We will first define the bounding box within which we will get OSM data using nominatimlite package. It uses the Nominatim API which is a geocoding service that powers the official OSM site. We used to get Census city boundary using tigris package, but since we are learning OSM, we can get the city boundary from OSM as well. Of course, you can create your own bounding box - we do it a few code chunks below.

# Get bounding box coordinates for Atlanta
bb <- nominatimlite::geo_lite_sf('Atlanta, GA', points_only = F) %>% 
  st_bbox()

bb_sf <- bb %>% st_as_sfc()

## Plot-- 
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(bb_sf) + tm_borders()

OSM wiki states that OMS represents physical features on the ground using tags attached to its basic data structures (i.e., nodes, ways, and relations). Although there can be infinite number of attributes (i.e., tags) describing each feature, there are some key-value pairs that are agreed upon by the community and are widely used. Although it might not be accurate description, but thinking the key as a category and value as entries within the category would be reasonably accurate description. Some of the widely used keys include Amenity, Buildings, Highway, and Shop. Then there are values attached to each of the key. For Amenity, there are education, transportation, financial, healthcare, and entertainment, etc.

For Highway (which means road in OSM), the followings are some examples of the values:

  • motorway
  • trunk
  • primary
  • secondary
  • tertiary
  • residential
  • various links, including motorway_link, trunk_link, primary_link, etc.

For the full list of the key-pair values, please refer to OSMwiki. To download all possible key:value pairs for highway, you can insert available_tags("highway") instead of manually specifying a list of values. One caveat is that, using all available tags will generate a large data, significantly slowing down the processing speed. Here, we download seven values under key = ‘highway’ and convert it into sf format. The last function in the code below, osm_poly2line() is added because “Street networks downloaded with add_osm_object(key =”highway”) will store any circular highways in osm_polygons. this function combines those with the osm_lines component to yield a single sf data.frame of all highways, whether polygonal or not” (source).

# Get OSM road data
osm_road <- opq(bbox = bb) %>%
  add_osm_feature(key = 'highway', 
                  value = c("motorway", "motorway_link",
                            "trunk", "trunk_link", 
                            "primary", "primary_link",
                            "secondary", "secondary_link",
                            "tertiary", "residential")) %>%
  osmdata_sf() %>% 
  osm_poly2line()

names(osm_road)
## [1] "bbox"              "overpass_call"     "meta"             
## [4] "osm_points"        "osm_lines"         "osm_polygons"     
## [7] "osm_multilines"    "osm_multipolygons"

This object osm_road is a list that has multiple elements, including bbox, osm_points, osm_lines, osm_polygons, etc. We can extract osm_lines and see what it looks like. The prop.table() function below shows that ‘residential’ tag is roughly 70%.

## Plot--
tmap_mode('plot')
## tmap mode set to plotting
tm_shape(osm_road$osm_lines) + tm_lines(col = "highway")

# Breakdown of highway types
round(prop.table(table(osm_road$osm_lines$highway)) * 100, 1)
## 
##       motorway  motorway_link        primary   primary_link    residential 
##            3.3            4.6            5.5            0.6           63.4 
##      secondary secondary_link       tertiary          trunk     trunk_link 
##           10.8            0.8            7.9            2.9            0.2

Defining a custom bounding box

Displaying the entire network we just downloaded can slow down R environment. For class exercise (for students to be able to follow along), we will limit the bounding box to a smaller area. We will create a custom bbox using coordinates of our selection. This technique of generating your own bounding box can also be useful if you have a specific area of interest that doesn’t overlap well with commonly used boundaries.

You need to define two points for bounding box, one point at the lower left corner and the other at the upper right corner. You can go to Google Maps, right-click on a point on map, and copy the XY coordinate. Let’s store them in p1 and p2. Notice that Google Maps will give you YX coords, but we need p1 and p2 in XY format.

# p1 is lower left corner, p2 is the upper right corner
p1 <- c(-84.42138164090896, 33.729724924925925)
p2 <- c( -84.34372362786227, 33.79761117884929)

You will then need to format the two coordinates into the following format. Then, let’s convert it into sf object.

# Custom BB
my_bb <- c(st_point(p1) %>% st_sfc(crs = 4326), 
           st_point(p2) %>% st_sfc(crs = 4326)) %>% st_bbox()

# Custom BB to sf
my_bb_sf <- my_bb %>% st_as_sfc()


# Extract a smaller network for exercise purpose
osm_small <- osm_road$osm_lines[my_bb_sf,] %>% select(osm_id, highway)

## Plot--
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(bb_sf) + tm_borders(col = "black") +  # Black = larger bbox
  tm_shape(my_bb_sf) + tm_borders(col = "red") +  # Red = smaller bbox
  tm_shape(osm_small) + tm_lines(col = "black") # Black line = small network

Converting OSM data into network using sfnetworks package

Now we have OSM network geometry. This network geometry, however, has many issues that need to be resolved before it can be used for analysis. We will use sfnetworks package to do both cleaning and analysis. We can pass osm_small to as_sfnetwork() to convert it to sfnetwork object. It is important to note that the sf object you pass to as_sfnetwork() function should be LINESTRING. When using as_sfnetwork(), it creates a directed network by default. For simplicity, we will set this to FALSE.

You can see that the output net object has two parts: nodes and edges. Also notice that the node data is denoted as ‘active’. The sfnetworks objects is made of nodes and edges that connect the nodes. When you do data manipulation using, for example, dplyr verbs such as mutate, you can choose whether to apply it to nodes or edges by activating the one that you want to modify.

# Converting the line component of OSM data into a network
net <- sfnetworks::as_sfnetwork(osm_small, directed = FALSE)
print(net)
## # A sfnetwork with 4543 nodes and 4228 edges
## #
## # CRS:  EPSG:4326 
## #
## # An undirected multigraph with 816 components with spatially explicit edges
## #
## # A tibble: 4,543 × 1
##               geometry
##            <POINT [°]>
## 1 (-84.34923 33.74581)
## 2  (-84.35038 33.7454)
## 3 (-84.34893 33.76049)
## 4 (-84.34917 33.76218)
## 5 (-84.34917 33.76184)
## 6 (-84.34892 33.76025)
## # ℹ 4,537 more rows
## #
## # A tibble: 4,228 × 5
##    from    to osm_id  highway                                           geometry
##   <int> <int> <chr>   <chr>                                     <LINESTRING [°]>
## 1     1     2 9164335 motorway_link (-84.34923 33.74581, -84.3491 33.74624, -84…
## 2     3     4 9164434 trunk_link    (-84.34893 33.76049, -84.34893 33.76072, -8…
## 3     5     6 9164438 trunk_link    (-84.34917 33.76184, -84.34909 33.7619, -84…
## # ℹ 4,225 more rows
print(paste0('Which one is active?: ', sfnetworks::active(net)))
## [1] "Which one is active?: nodes"

You can also extract nodes and edges as separate sf objects using st_as_sf() function. To extract sf objects representing edges and nodes and store them separately, you can do (1) net %>% activate("nodes") %>% st_as_sf() or, equivalently, (2) net %>% st_as_sf("nodes"). The result is the same. In the leaflet() function below which takes sf objects but not sfnetworks objects, you need to extract nodes and edges separately and pass the extraction. The yellow points are nodes and gray lines are edges.

## Plot--
leaflet() %>% 
  addProviderTiles(providers$CartoDB.DarkMatter) %>% 
  setView( -84.3854, 33.7618, zoom = 13) %>% 
  addPolylines(data = net %>% st_as_sf('edges'), 
               col = "gray", 
               weight = 3, 
               opacity = 0.9, 
               popup = net %>% st_as_sf('edges') %>% pull(osm_id)) %>% 
    addCircles(data = net %>% st_as_sf('nodes'), 
               fillColor = 'yellow', 
               stroke = F, 
               radius = 20, 
               fillOpacity = 0.7) 

Cleaning network

Simplifying network

This content is heavily based on this sfnetwork tutorial

A network data may contain edges that connect the same pair of nodes, called multiple edges. There can also be loops that starts and ends at the same node (e.g., think of a cul-de-sac). The former case can be detected using edge_is_multiple() and the latter using edge_is_loop().

When removing a set of multiple edges using functions shown below, they always keep the first edge and discard others. By arranging the order of edges for each set of multiple edges, you can specify which edge you want to preserve.

This way of simplification means that, when the multiple edges within a set have different attributes, all attribute information except for the preserved one would be lost. In such cases, you can merge those edges. Then, the output would have the geometry of the first edge in the set, but the attributes would be some summary of all the edges in the set. to_spatial_simple() function does this work (but not used in this document).

# Let's simplify our network
simple_net <- net %>%
  activate("edges") %>%
  filter(!edge_is_multiple()) %>%
  filter(!edge_is_loop()) 

# Because the difference is not really discernible, we just print out the differences in the edge count.
message(str_c("Before simplification, there were ", net %>% st_as_sf("edges") %>% nrow(), " edges. \n",
            "After simplification, there are ", simple_net %>% st_as_sf("edges") %>% nrow(), " edges."))
## Before simplification, there were 4228 edges. 
## After simplification, there are 4197 edges.

Subdivide edges

Why subdivide?

When as_sfnetwork() function converts an sf linestrings, the nodes are defined as the endpoints of each linestring. If you zoom into Midtown in the map above, there are many edges that crosses (i.e., intersections) but do not have nodes. The absence of nodes at what appears to be intersections would be legitimate data if the two crossing roads are, for example, bridges that have different elevations. However, we anecdotally know that there aren’t that many such overhead bridges in our study area.

We can use to_spatial_subdivision predicate to create points at the intersections and cut the edges accordingly.

“Construct a subdivision of the network by subdividing edges at each interior point that is equal to any other interior or boundary point in the edges table. Interior points in this sense are those points that are included in their linestring geometry feature but are not endpoints of it, while boundary points are the endpoints of the linestrings. The network is reconstructed after subdivision such that edges are connected at the points of subdivision.” (source)

This to_spatial_subdivision function is a part of functions called spatial morphers. Morphers are rooted in tidygraph, which allows you to temporarily morph the topology of the original network, do some work to the “morphed state” of the network, and merge the results back to the original network using unmorph() (See this document from which I borrowed heavily). It is by design that morphers are used not directly but by passing them to morph() function.

Although morphing is meant to be temporary, we can also make the changes permanent by passing them to convert() function instead of morph() function because we want to make permanent changes to our network (i.e., create nodes at intersections), we will use convert().

# # Using spatial morpher
subdiv_net <- convert(simple_net, sfnetworks::to_spatial_subdivision)
## Warning: to_spatial_subdivision assumes attributes are constant over geometries
# -------------------------------------------
# Below is done for visualization purpose
subdiv_net <- subdiv_net %>% 
  activate("nodes") %>% 
  mutate(custom_id = seq(1, subdiv_net %>% st_as_sf("nodes") %>% nrow()),
         is.new = case_when(is.na(.tidygraph_node_index) ~ "new nodes",
                            TRUE ~ "existing nodes"),
         is.new = factor(is.new)) 

## Plot--
subdiv_pal <- colorFactor(palette = c("yellow", "red"), domain = subdiv_net %>% st_as_sf("nodes") %>% pull(is.new))

subdiv_map <- leaflet() %>% 
  addProviderTiles(providers$CartoDB.DarkMatter) %>% 
  setView( -84.3854, 33.7618, zoom = 13) %>%
  addPolylines(data = subdiv_net %>% st_as_sf('edges'), col = "grey", weight = 3, opacity = 0.9) %>% 
  addCircles(data = subdiv_net %>% st_as_sf('nodes'), fillColor = ~subdiv_pal(is.new), stroke = F, radius = 20, fillOpacity = 0.7) %>% 
  addLegend(position = "bottomright", pal = subdiv_pal, values = subdiv_net %>% st_as_sf("nodes") %>% pull(is.new)) 

subdiv_map

Delete pseudo nodes

Notice that there are nodes on edges that are unnecessary nodes called pseudo nodes. For directed networks, pseudo nodes are those that have only one incoming and one outgoing edge. For undirected networks, pseudo nodes are those that have two incident edges. If these edges connected by these pseudo nodes are identical in their attributes, these nodes are unnecessary in any ways. These nodes can be delete using to_spatial_smooth predicate.

# Using spatial morpher
smoothed_net <- convert(subdiv_net, sfnetworks::to_spatial_smooth)

# -------------------------------------------
# Below is done for visualization purpose
# Extract removed points for mapping purposes
removed <- setdiff(subdiv_net %>% st_as_sf('nodes') %>% pull(custom_id), 
                   smoothed_net %>% st_as_sf('nodes') %>% pull(custom_id))

smooth_map <- leaflet() %>% 
  addProviderTiles(providers$CartoDB.DarkMatter) %>% 
  setView( -84.3854, 33.7618, zoom = 13) %>%
  addPolylines(data = smoothed_net %>% st_as_sf('edges'), col = "grey", weight = 3, opacity = 0.9, popup = smoothed_net %>% st_as_sf('edges') %>% rownames()) %>% 
    addCircles(data = smoothed_net %>% st_as_sf('nodes'), fillColor = 'yellow', stroke = F, radius = 20, fillOpacity = 0.7, group = "kept") %>% 
    addCircles(data = subdiv_net %>% st_as_sf("nodes") %>% filter(custom_id %in% removed), 
               fillColor = "red", stroke = F, radius = 15, fillOpacity = 0.4, group = "removed") %>% 
  addControl(html = htmltools::HTML("<b>Red points denote removed nodes</b>"), position = "bottomright") %>% 
  addLayersControl(overlayGroups = c("kept", "removed"))

smooth_map

Calculate centrality

One of the most common ways for analyzing the characteristics of networks are centrality measures. These algorithms assigns some numbers to nodes within a graph based on their position in the given network. There are many centrality measures, each with their own characteristics, as illustrated in the images below. We will try betweenness centrality and degree centrality.

Betweenness centrality measures “the percentage of shortest paths that must go through the specific node (Golbeck 2005, p.229)”.

Degree centrality is the simplest centrality measure to compute: it is a count of how many connections a node has (Golbeck 2005, p.226).

(Source: wikipedia)

(Source: Figure 1 in Cordeiro et al. (2018))

Betweenness centrality

# Calculate centrality measures
network_char <- smoothed_net %>% 
  activate("edges") %>%
  mutate(weight = edge_length()) %>% 
  mutate(edge_bc = centrality_edge_betweenness(weights = weight, directed = F)) %>%
  activate("nodes") %>% 
  mutate(node_bc = centrality_betweenness(weights = weight, directed = F))

# Edge betweenness
bet_pal_edge <- colorNumeric(palette = "inferno", domain = network_char %>% activate("edges") %>% pull(edge_bc), n = 10)

# Node betweenness
bet_pal_node <- colorNumeric(palette = "inferno", domain = network_char %>% activate("nodes") %>% pull(node_bc), n = 10)


# Map
leaflet() %>% 
  addProviderTiles(providers$CartoDB.DarkMatter) %>%
  setView( -84.3854, 33.7618, zoom = 13) %>%
  addPolylines(data = network_char %>% st_as_sf("edges"), 
               color = ~bet_pal_edge(network_char %>% st_as_sf('edges') %>% pull(edge_bc)), weight = 3, opacity = 0.9) %>% 
  addCircles(data = network_char %>% st_as_sf("nodes"), 
               fillColor = ~bet_pal_node(network_char %>% st_as_sf('nodes') %>% pull(node_bc)), stroke = F, fillOpacity = 0.9, 
             radius = network_char %>% st_as_sf("nodes") %>% with(.$node_bc/(max(.$node_bc)/100))) # denominator is selected to make the max value roughly equal to 100

Degree centrality

Intersections are used in various measures of the built environment. For example, intersection density (number of intersections divided by the area) is used to calculate a walkability index (Frank et al., 2005) as well as in the well-known 5D framework (Ewing and Cervero, 2010). Remember that (unweighted) degree centrality simply means the number of edges each node has. Because we have cleaned the network by dropping pseudo nodes, the intersection would be those nodes with degree centrality greater than 1.

intersections <- smoothed_net %>%
  st_transform(crs = 4326) %>% 
  activate("nodes") %>% 
  mutate(degree = centrality_degree(weights = NULL)) %>% 
  filter(degree > 1)

color_palette <- colorFactor(
  palette = c("#878686", "#ffffff", "#ffed78", "#ff7878"),
  domain = intersections %>% st_as_sf("nodes") %>% .$degree
)

leaflet() %>% 
  addProviderTiles(providers$CartoDB.DarkMatter) %>% 
  setView( -84.3854, 33.7618, zoom = 13) %>%
  addPolylines(data = smoothed_net %>% st_as_sf('edges'), col = "grey", weight = 3, opacity = 0.9) %>% 
    addCircles(data = intersections %>% st_as_sf('nodes'), 
               fillColor = ~color_palette(degree), 
               stroke = F, 
               radius = ~degree * 10, 
               fillOpacity = 0.9)

Shortest path

One of the most widely performed task with road networks is to find a route between A and B, often the shortest path. IT can be easily done using sfnetworks package’s st_network_paths().

First, you need to create sf objects with POINT geometry that represent origin and destination. These points do not need to be ON the network - the sf_network_paths() will find the nearest node and use them for the calculation. You can also supply multiple destinations to a single call of st_network_paths(), but the origin has to be the same. Then, you can pass them to st_network_paths() along with the network. The output is a data frame (a tibble to be more specific) with two columns, node_paths and edge_paths. This data frame will have as many rows as the number of origin-destination pairs. Opening up the content of the data frame, you will notice that it is just a sequence of nodes and edges through which the shortest path runs.

# Change crs for convenience
smoothed_net <- smoothed_net %>% st_transform(4326)

# Start point
start_p <- nominatimlite::geo_lite_sf('Georgia Tech CRC') %>% st_geometry() # same as `pull(gemetry)`
# End point
target_p1 <- nominatimlite::geo_lite_sf('Atlanta Botanical Garden') %>% st_geometry()
target_p2 <- nominatimlite::geo_lite_sf('Krog Street Market') %>% st_geometry()

# Get the shortest path
paths <- st_network_paths(smoothed_net, from = start_p, to = c(target_p1, target_p2), type = "shortest")

# Extract shortest path
paths_sf1 <- smoothed_net %>%
  activate("nodes") %>% 
  slice(paths$node_paths[[1]])

paths_sf2 <- smoothed_net %>%
  activate("nodes") %>% 
  slice(paths$node_paths[[2]])


# Visualize
leaflet() %>% 
  addProviderTiles(providers$CartoDB.DarkMatter) %>% 
  setView( -84.3854, 33.7618, zoom = 13) %>%
  # Base network in grey
  addPolylines(data = smoothed_net %>% st_as_sf("edges"), color = 'grey', weight = 2, opacity = 0.7) %>% 
  
  # Chosen edges
  addPolylines(data = paths_sf1 %>% st_as_sf("edges"), color = "pink", weight = 10, opacity = 0.3) %>% 
  # Chosen nodes
  addCircles(data = paths_sf1 %>% st_as_sf("nodes"), stroke = F, fillColor = "pink", fillOpacity = 0.8, radius = 50) %>% 
  
  # Chosen edges
  addPolylines(data = paths_sf2 %>% st_as_sf("edges"), color = "lightgreen", weight = 10, opacity = 0.3) %>% 
  # Chosen nodes
  addCircles(data = paths_sf2 %>% st_as_sf("nodes"), stroke = F, fillColor = "lightgreen", fillOpacity = 0.8, radius = 50)

Using GTFS and OSM together

Although GTFS and OSM are different data sources and may seem to reside in totally separate world, people in the real world frequently mix the two as they navigate in a city. For example, one might drive from home to the nearest MARTA rail transit station, park their car there, and ride transit to their work. To analyze such movement, we need to be able to somehow connect the two datasets.

For exercise, let’s examine how to compute the shortest path from one location to MARTA rail stations. Since we are only interested in rail stations, we first need to extract transit stops for rail. As we discussed last week, GTFS is a relational table. The information on whether a transit facility is for bus or rail is contained in routes table, but it is not possible to link stops table and routes table directly. We will join routes -> trips -> stop_times -> stops. Then, we will extract the stop_id of rail stations.

# Read GTFS
transit <- tidytransit::read_gtfs("https://www.itsmarta.com/google_transit_feed/google_transit.zip") %>% 
  tidytransit::gtfs_as_sf()

# Get stop_id for rails
rail_stops <- transit$routes %>% 
  filter(route_type %in% c(0,1,2)) %>% 
  inner_join(transit$trips, by = "route_id") %>% 
  inner_join(transit$stop_times, by = "trip_id") %>% 
  inner_join(transit$stops, by = "stop_id") %>% 
  group_by(stop_id) %>% 
  slice(1) %>% 
  pull(stop_id)

tm_shape(transit$stops %>% filter(stop_id %in% rail_stops)) + tm_dots(col = "red")

Now, we need the city-wide OSM network that we downloaded at the beginning.

# Using the full network file 
atl_net <- osm_road$osm_lines %>%
  select(osm_id, highway) %>% 
  sfnetworks::as_sfnetwork(directed = FALSE) %>% 
  activate("edges") %>%
  filter(!edge_is_multiple()) %>%
  filter(!edge_is_loop()) %>% 
  convert(., sfnetworks::to_spatial_subdivision) %>% 
  convert(., sfnetworks::to_spatial_smooth)
## Warning: to_spatial_subdivision assumes attributes are constant over geometries

Next, the remaining steps are:

  1. Define the starting point.
  2. Define the ending points (which are MARTA rail stations).
  3. Fine the shortest paths from the starting point to all ending points (i.e., stations).
  4. Calculate the distance of each route between the starting point to each MARTA rail stations.
  5. Replace those with 0 distances with some large number (e.g., maximum distance). These are those for which the algorithm could not find connecting routes.
  6. Find the one with the minimum distance.
  7. Map it!
# Change crs for convenience
atl_net <- atl_net %>% st_transform(4326) %>% activate("edges") %>% mutate(length = edge_length())

# Step 1. Start point
start_p <- nominatimlite::geo_lite_sf("IKEA Atlanta") %>% st_geometry() # I chose IKEA

# Step 2. End point
rail_stops_point <- transit$stops %>% filter(stop_id %in% rail_stops) %>% st_geometry()

# Step 3. Get the shortest path
paths <- st_network_paths(atl_net, from = start_p, to = rail_stops_point, type = "shortest")

# Step 4. Find the distances
dist_all <- map_dbl(1:nrow(paths), function(x){
  atl_net %>% activate("nodes") %>% slice(paths$node_paths[[x]]) %>% st_as_sf("edges") %>% pull(length) %>% sum()
}) %>% unlist() 

# Step 5. Replace zeros with a large values
if (any(dist_all == 0)){
  dist_all[dist_all == 0] <- max(dist_all)
}

# Step 6. Find the smallest one
closest_index <- which.min(dist_all)

# Extract shortest path
paths_closest <- atl_net %>%
  activate("nodes") %>% 
  slice(paths$node_paths[[closest_index]])

# Selected station
closest_station <- transit$stops %>% filter(stop_id %in% rail_stops) %>% slice(closest_index)

# Step 7. Map it!
leaflet() %>% 
  addProviderTiles(providers$CartoDB.DarkMatter) %>% 
  # Base network in grey
  addPolylines(data = atl_net %>% st_as_sf("edges"), color = 'grey', weight = 2, opacity = 0.7) %>% 
  
  # Chosen edges
  addPolylines(data = paths_closest %>% st_as_sf("edges"), color = "red", weight = 10, opacity = 0.5) %>% 
  # Chosen nodes
  addCircles(data = paths_closest %>% st_as_sf("nodes"), stroke = F, fillColor = "red", fillOpacity = 0.8, radius = 50) %>% 
  # Chosen station
  addCircles(data = closest_station, fillColor = "blue", radius = 300)